EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records

Privacy concerns often arise as the key bottleneck for the sharing of data between consumers and data holders, particularly for sensitive data such as Electronic Health Records (EHR). This impedes the application of data analytics and ML-based innovations with tremendous potential. One promising approach for such privacy concerns is to instead use synthetic data. We propose a generative modeling framework, EHR-Safe, for generating highly realistic and privacy-preserving synthetic EHR data. EHR-Safe is based on a two-stage model that consists of sequential encoder-decoder networks and generative adversarial networks. Our innovations focus on the key challenging aspects of real-world EHR data: heterogeneity, sparsity, coexistence of numerical and categorical features with distinct characteristics, and time-varying features with highly-varying sequence lengths. Under numerous evaluations, we demonstrate that the fidelity of EHR-Safe is almost-identical with real data (<3% accuracy difference for the models trained on them) while yielding almost-ideal performance in practical privacy metrics.


Utility metric
The utility metric focuses on usefulness of the synthetic data for a given task. One common use case of synthetic data would be developing predictive models on them without access to the real data. The ideal scenario would be synthetic and real data having sufficiently similar characteristics that they would yield similar models (when the same model development procedure is applied on them), and eventually similar predictions on the unseen real data (please see [1] for a more comprehensive discussion). It should be noted that choice of training data, features to predict, and models to train can all have a big impact on observed utility. For instance, low capacity or poorly-tuned models may result in low accuracy regardless of the quality of the synthetic data. Throughout this paper, we present numerous results of the downstream model performance that shows that the real data can be replaced with synthetic data with minimal performance penalty.

Multi-target utility metric
The utility metric is inherently limited as a measure of the realisticness of synthetic data, as in the general case, similarity of model performance does not strictly imply similarity of the underlying training data. For instance the target feature may be completely unrelated with the rest of the dataset, leading to equally bad performance regardless of the training data being used. Instead, the utility metric attempts to measure the preservation of predicatively useful statistical properties of the underlying data. However, only measuring utility with respect to a single target variable, may fail to capture important aspects of the data. Additionally, only considering the scenario where all features are available for prediction may fail to capture the utility of features of lesser importance for the predictive task under consideration.
To address these limitations, we propose a framework to more comprehensively evaluate utility. In order to validate that all features are well-preserved in the synthetic data, we can compute utility using every possible subset of features as predictors (instead of using all features). If for each possible subset of features, a common model trained on both datasets always makes similar predictions, then we can make stronger claims about the usefulness of the synthetic dataset. To further validate the utility metric under diverse settings, we can measure performance when predicting any feature (which is not part of the training data) instead of fixing on only one such as mortality. In this paper, we focus on predicting only static-categorical data.
There are 2 n subsets of features, which makes the task of running the utility-metric using each possible subset a computationally unfeasible task. Instead, we can use the hypothesis-testing framework and work with a confidence level. This approach will make our approach computationally feasible while giving us a confidence level on our results.
We state our null hypothesis (H0) as follows: The mean of the absolute difference between the models trained on real and synthetic data measured using metric M on predicting any categorical static feature F with model P using any set of features is greater or equal than X.
We use Random Forest (RF) P due to its high accuracy while being relatively fast for training. We randomly choose in each experiment the target feature to predict F among the available categorical static variables (mortality, gender, condition-code, marital-status and religion). We also choose a random subset of features to use for prediction, while avoiding to use the feature F. We run the experiment n = 30 times and compute the Area Under the Receiver Operating Characteristic Curve (AUC) as our metric M. Then, we compute the statistical test that the mean of the underlying distribution of the sample (i.e. the n results representing the absolute differences between training-real and training-synthetic) is greater than the given population mean (i.e. the percent we are using in our hypothesis or X 100 ). We did this only for MIMIC dataset as for the eICU dataset only two target variables were available (mortality and gender), both of which we used for the results in the main manuscript. The mean of the differences was 0.057. For X = 6, the p-value (computed by one sample T-test) to reject the hypothesis was 0.052.

Privacy metrics
In this section, we overview the privacy attacks against the model. We choose three privacy metrics that represent known approaches which adversaries may apply to de-anonymize private data -(i) membership inference, (ii) re-identification, (iii) attribute inference. These metrics are highly practical as they represent the expected risks that currently prevent sharing of conventionally anonymized data. Furthermore, they are highly interpretable, as results for these metrics directly measure the risks associated with sharing synthetic data. For instance, membership inference is a common attack for extracting protected attributes from private data. The ability for attackers to identify patients within a dataset also gives access to any sensitive attributes within the data. Thus, demonstrating resilience to membership inference attacks is a necessary condition for creating synthetic medical records that are safe to share. Furthermore, this metric can evaluate whether the generative model just memorizes the training data or learns the distributions of the original data which is critical for privacy. For (ii) and (iii) privacy metrics are about whether we can identify the private attributes if some non-private attributes are revealed. These are also highly practical and easy to interpret privacy metrics.
Our assumption for these attacks is that the adversary can be either in possession of the original data or the synthetically generated data. The attacks below cover the scenarios where the adversary only has access to the data rather than the model and does not cover the insider threat cases where the model is in the hands of the malicious party.

Membership inference attack
The adversary's goal with this attack is to understand whether an individual's data has been used for training the synthetic data generation model. In this case, the specific attribute is whether an individual has participated in a medical study. To evaluate the risk of this attack, we first divide the original data into training (50%) and holdout data (50%), and train EHR-Safe using only the training split. After generating the synthetic data, we train a k nearest neighbor (kNN) model fitted on the synthetic data. We assume that the adversary is in possession of real data containing both training data and holdout samples. Using the kNN model, we identify the closest neighbors for each sample in the real dataset. Using a minimal threshold line (e.g. minimum Hamming Distance), we predict whether the real data sample belongs to the training data. Then, we calculate the accuracy, since we are in possession of the labels (unlike in the case of a real attacker). For an ideal model (without privacy risk), the prediction accuracy would be 50%. If EHR-Safe had high privacy leakage, the kNN model would lead to higher accuracy.

Re-identification attack
The re-identification attack is a linkage attack analysing if a certain subset of features suggests that the synthesized sample belongs to a certain individual, then the same suggestion also holds with another subset of features for the same sample. We define the re-identification ratio utilizing the feature subset proximity, as a way of robustness against this linkage attack. We first divide the synthetically generated dataset into two subsets based on the features. We simply use half of the features in each subset. Then, we find the nearest neighbors of each sub dataset from the original dataset to find one-to-one mapping between synthetic data and original data. Finally, we check whether these one-to-one mappings are consistent between two subsets.
If they are consistent, we treat that sample as re-identifiable. The optimal value with this metric (i.e. no privacy risk) can be computed by replacing the synthetic data into disjoint holdout original data.

Attribute inference attack
For this attack, the adversary has some partial information about some individuals and based on correlating that information with the synthetic data, the attack analyses their ability to infer the specific attributes more accurately. We focus on gender, age, and race as the sensitive features for the experiments. Then, using the rest of the features, we assess the predictability of the values of these sensitive features. As the baseline, we consider predicting these sensitive features using original data.

Supplementary Methods: Post-processing to minimize distribution distance
When the fidelity metric of KS-statistics optimization is considered, further improvements can be obtained with a post-processing procedure to refine the distributions. We propose a post-processing method that is applied to each feature individually to optimize the resulting KS-statistic between each pair of features.
In order to understand the method for optimizing, we start by giving a brief overview of the KS-statistic computation: • Samples from both datasets are concatenated and sorted; • For each observation, estimate the CDF of original and synthetic data, as well as their absolute difference; • Compute the survival function over the maximum difference.
The goal of the proposed post-processing method is to define an arbitrary value dist max as the maximum acceptable CDF difference between synthetic and real data, and find a set of minimal changes to the synthetic dataset such that the dist max criteria is satisfied. Define x after the optimization: We want to minimize the amount of modifications to S s while satisfying these inequalities. Thus, we only need to modify elements until the elements are equal, more modifications may also satisfy the inequality but will require more changes: Therefore, we need to modify S s such that count(S s , x) = N . There are many possible ways to modify S s in order to have the first N values less or equal to x, and the rest of the values greater than x. Below is the approach we propose: 1. For the first N values of S s : Change any value greater than x with x 2. For the rest of the |S s | − N values, replace every value that is less or equal to x with value y where y is the smallest value that belongs to S s and is greater than x.
Note that we always use the replacement values existing in either S s or S o . This avoids adding values that are nonexistent in the given sets. We use dist max = 0 for all the features. As Fig. 1 shows, the KS-statistics can drastically improve with the proposed procedure.

Supplementary Methods: Listwise feature generation
Listwise feature is another common data type in EHR data that represents a list of components at single measurement time. For instance, ICD-9 or ICD-10 codes at certain time point for a specific patient can be multiple; those are represented as listwise features. More examples can be found in Fig. 2  Generating synthetic listwise features would be important to generate realistic healthcare synthetic data.
Fortunately, with small modifications, EHR-Safe can generate synthetic listwise features. We describe the details of the modified EHR-Safe framework to generate listwise features in the following sections.

Data preprocessing
The number of values in each listwise feature per patient can be different. We first aggregate those multiple components and convert them into a numerical (binary) matrix.
id Time Temporal listwise feature As can be seen in Fig. 3, the number of columns is the number of unique component in listwise features.
Each row represents the unique patient id and measurement time. Present/absence of the component is represented as 1/0 in the converted numerical matrix. This is very similar with categorical data preprocessing except the number of 1s in each row; listwise features can have multiple 1s in each row.

Encoder-decoder framework with listwise features
After encoding the listwise features, we can incorporate those encoded features into the categorical encoder and decoder framework.  As shown in Fig. 4, the overall architecture is highly overlapped with categorical encoder-decoder framework.
One small difference is that we use sigmoid output activation function for the listwise decoder (instead of softmax) because multiple 1s can be possible in the converted listwise features in each row. The embedded representations include both categorical and listwise feature information.

Fidelity of the synthetic listwise features
To evaluate the fidelity of the synthetic listwise features, we first illustrate some example synthetic listwise features for some patients. As can be seen in Fig. 5, the generated synthetic listwise features are diverse and realistic. One interesting and important point is that we generate ICD-9 and ICD-10 codes independently but those are exactly matched in the synthetic listwise features.
We also plot the frequency of the listwise features in Fig. 6 which shows that the frequencies of top 10 features in both condition code and diagnosis are well aligned.

Supplementary Methods: Training details and hyperparameters
In this section, we describe the details of EHR-Safe model training and hyper-parameters that we used.

Alternative model training
In the main manuscript, we evaluate the utility and privacy performances of generated synthetic data Note that the alternatives are not designed to handle various challenges of EHR data including varying lengths of sequences, sparsity, categorical features, and joint representation of static and time-varying features.
To address those challenges with alternative methods, we introduce the following modifications.
• Varying lengths of sequences: Use padding approaches to make fixed length of sequences • Sparsity: Use missing indicator to identify the missing components • Categorical features: Use the integer encoding to convert string categories to integer. We avoid using one-hot encoding due to the large number of categories per each categorical feature.
• Joint representation of static and time-varying features: We treat the static features as duplicated time-series features for joint modeling.

Propensity scores -Distinguishing synthetic data from original data
In this subsection, we train an ad-hoc binary classifier whose objective is to identify real samples from the synthetic samples. If the performance of the ad-hoc classifier (discriminator) is closer to 0.5 (random guessing), we can claim that the synthetic data preserve the original data properties well.  We also report the propensity scores per features to check which features are more realistic/unrealistic compared with the original features. As can be seen in Fig. 7, the discriminator performance is lower than 0.6 for most features.

Data coverage visualizations
Having similar coverage, and avoiding under-representation of certain data regimes, is crucial for synthetic data generation. We use t-SNE (t-distributed stochastic neighbor embedding) analyses to provide a qualitative intuition as to how well our synthetic data overlap the original data. More specifically, t-SNE analysis serves as a non-linear dimensionality reduction method to visualize high dimensional data by giving each data point a location in a two-dimensional map.   As Fig. 8 shows, the coverage of the synthetic data is very similar with the coverage of the original data. Note that for three-axes temporal and mask data, we first convert them into two axes data where each column represents each feature at each time point.

Algorithmic fairness analysis
In this subsection, we provide algorithmic fairness analyses for different sensitive attributes: gender, marital status, religion for MIMIC-III; and gender for eICU. We focus on the mortality prediction as the downstream task and random forest as the predictive model.
We utilize three different metrics to evaluate the algorithmic fairness of original and synthetic data: • Demographic parity: Differences between probability of being assigned to the positive class, across the subgroups divided by the attributes; • Equalized odds: True Positive Rates (TPR) and False Positive Rates (FPR) differences across the subgroups divided by the attributes; • Overall accuracy equality: Performance (with AUC being the metric) differences across the subgroups divided by the attributes.
More details of these algorithmic fairness metrics can be found in [4]. Fig. 9 shows that the algorithmic fairness performances metrics between the original and synthetic data are consistent across various subgroups.
In other words, the algorithmic fairness biases across different subgroups is not amplified by the synthetic data generated by EHR-Safe compared with original data. Fig. 10 shows the cumulative distribution function (CDF) curves with and without stochastic normalization, highlighting it's key role in improving the fidelity of synthetic data.

Additional Privacy Results
In this section, we present privacy attack results. For the preliminary results, we used the Euclidean distance metric for the kNN algorithms. However, since the data we generate is time series based type of the data, it is recommended to also consider distance metrics that cover time. We consider three more time-series distance metrics in kNN models from tslearn package:  • SoftDTW: A more advanced version of the DTW where the difference can be computed at every point. The implementation of this metric is much faster compared to DTW as well.

MIMIC-III -
• Canonical Time Warping (CTW): An improved version of DTW where the difference can be calculated in more complex scenarios where there is rotation and transformation of the data over time.
The results are provided in Table 4.  Table 4: Privacy risk evaluation with different distance metrics (DTW, SoftDTW, CTW). For membership inference, the ideal value is random guessing (i.e. 0.5) whether an original sample has been leveraged for training the synthetic data generation model. For the re-identification, the ideal case is to replace the synthetic data with holdout original data which is disjoint with the training data. For attribute inference attack, we set three static features (gender, race, medical status -note that eICU only has a gender attribute) as the specific attributes and report prediction AUC. The baseline scenario is measured by performing feature prediction using the original data. For multi-class data such as marital status or religion, we compute the pairwise AUC values across all possible categories and report their average values. Fig. 11 shows the pairwise Pearson correlations -a measure of linear correlation between two sets of data to evaluate whether the correlation between features are well conserved -between temporal numerical features.

Statistical similarity
We observe almost identical heatmaps indicating that the generated synthetic data largely conserve the original correlations. Table 5 and 6 present the statistical similarity per each feature in MIMIC-III and eICU datasets, respectively. Fig. 12 shows the top 10 frequent categories' frequencies for original and synthetic data. The distributions for the static and temporal categorical features are well aligned between original and synthetic data.

Feature importance analyses
In this section, we introduce feature importance comparisons as another fidelity measure to verify that the synthetic data can preserve the important feature characteristics of the original dataset.
More specifically, we extracted the feature importance (computed by mean decrease in impurity (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances. html#feature-importance-based-on-mean-decrease-in-impurity)) of two models: (i) trained on original data, (ii) trained on synthetic data using Random Forest (RF) and Gradient Boosting Decision Trees (GBDT) methods. Then, we plot top-30 ranked important features to qualitatively compare their similarities.
As can be seen in Fig. 13